General Information about the dataset


Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

  1. Title: Wine Quality

  2. Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

  3. Past Usage:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

  1. Relevant Information:

The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

  1. Number of Instances: red wine - 1599; white wine - 4898.

  2. Number of Attributes: 11 + output attribute

Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.

  1. Attribute information:

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)

  1. Missing Attribute Values: None

  2. Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)


Univariate Plots

Short overview of the used data:

## [1] 4898   14
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "qual"
## 'data.frame':    4898 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ qual                : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##                                                                        
##     alcohol      quality       qual      
##  Min.   : 8.00   3:  20   Min.   :3.000  
##  1st Qu.: 9.50   4: 163   1st Qu.:5.000  
##  Median :10.40   5:1457   Median :6.000  
##  Mean   :10.51   6:2198   Mean   :5.878  
##  3rd Qu.:11.40   7: 880   3rd Qu.:6.000  
##  Max.   :14.20   8: 175   Max.   :9.000  
##                  9:   5

The worst quality is 3, the best quality is 9 in the dataset, to get a better understanding of the quality ranking we plot a histogram.

The most values are in quality group nr. 5 and 6 In our analysis we try find a linear model to estimate the quality of the wine from the given parameter. To do the fact that we have different number for each qulity group I would like to plot histograms four each group.


It looks like that good quality wines have less alcohol than bad quality wines by checking the means and medians.

## Source: local data frame [7 x 3]
## 
##   quality mean_alcohol median_alcohol
## 1       3     10.34500          10.45
## 2       4     10.15245          10.10
## 3       5      9.80884           9.50
## 4       6     10.57537          10.50
## 5       7     11.36794          11.40
## 6       8     11.63600          12.00
## 7       9     12.18000          12.50

From the pH distribution I cant see any trends in the quality group. It should be noted that the difference between max and min is only 1.1

## [1] 1.1

## Source: local data frame [7 x 3]
## 
##   quality mean_density median_density
## 1       3    0.9948840       0.994425
## 2       4    0.9942767       0.994100
## 3       5    0.9952626       0.995300
## 4       6    0.9939613       0.993660
## 5       7    0.9924524       0.991760
## 6       8    0.9922359       0.991640
## 7       9    0.9914600       0.990300

Good wines tend to have density of 1, bad ones tend to 0.99 but on the other hand the spread between min and max is only 0.05187

## [1] 0.05187

The chlorides distribution looks pretty much the same in each group.

## Source: local data frame [7 x 3]
## 
##   quality quantile_0.1_chlor quantile_0.9_chlor
## 1       3              0.022             0.0668
## 2       4              0.013             0.0646
## 3       5              0.009             0.0640
## 4       6              0.015             0.0570
## 5       7              0.012             0.0510
## 6       8              0.014             0.0560
## 7       9              0.018             0.0338

You can see a left-skewed distribution in each group.


For sulphates applies the same, one look to the mean and median values looks also stable for each group.

## Source: local data frame [7 x 3]
## 
##   quality  mean_sul median_sul
## 1       3 0.4745000       0.44
## 2       4 0.4761350       0.47
## 3       5 0.4822032       0.47
## 4       6 0.4911056       0.48
## 5       7 0.5031023       0.48
## 6       8 0.4862286       0.46
## 7       9 0.4660000       0.46

The total.sulfur.dioxide values are very high compared to the other input variables (see min, max tabel below), the distribution looks stable in all classes.

## Source: local data frame [7 x 4]
## 
##   quality min_total.sulf mean_total.sulf max_total.sulf
## 1       3             19        170.6000          440.0
## 2       4             10        125.2791          272.0
## 3       5              9        150.9046          344.0
## 4       6             18        137.0473          294.0
## 5       7             34        125.1148          229.0
## 6       8             59        126.1657          212.5
## 7       9             85        116.0000          139.0

The free.sulfur.dioxide shows stable distributions in all classes with high differences between min and max.



You can see that there are approximatley 50% of the values between 0.8 and 0.9, looks interesting.

Using log10 transformation we can see two classes of whine, one with less sugar and another group with more sugar.
You can find this density in all quality group, looks like the sweetness on its own is no quality mark.

Here it is very interesting that we have these spike at level 0.49, maybe we can find the reason for that.


The distributions difference between max and min is low and there are some spikes in the data.


The distribution shows a huge spike at approx 6.8 as you can see in the median value

## Source: local data frame [7 x 3]
## 
##   quality mean_fixed.acidity median_fixed.acidity
## 1       3           7.600000                  7.3
## 2       4           7.129448                  6.9
## 3       5           6.933974                  6.8
## 4       6           6.837671                  6.8
## 5       7           6.734716                  6.7
## 6       8           6.657143                  6.8
## 7       9           7.420000                  7.1

Summary from some facts: You can see the mean values for alcohol, very good wines have 10 %, very bad wines have 12 %
From the pH value you can say better wines are more sour.
Good wines tend to have density of 1, bad ones tend to 0.99
The chlorides group show that bad wines have higher value than better one.


Univariate Analysis

What is the structure of your dataset?

There are 4898 different red wine variants with 13 features:
  • fixed acidity
  • volatile acidity
  • critic acid
  • residual sugar
  • chlorides
  • free sulfur dioxide
  • total sulfur dioxide
  • density
  • pH
  • sulphates
  • alcohol
  • quality

The variable quality is ordered factor variables with the following levels.

(worst) … (best)
quality: 0, 1, 2 ,3, 4, 5, 6, 7, 8, 9, 10

Other observations:

The median of the quality is 6. In alcohol there is a spike at 9.5 in residual sugar there is also a spike at 2.

If you look back to the quality data we saw that 1457 white wines get quality 5, 2198 wines get a 6 and 880 wines get a 7. Now it is interesting to see that the distribution for quality is skewed.

The distribution for alcohol, sulphates, chlorides, residual sugar, volatile acidity are also skewed. That is my objective opinion by looking to the distribition charts.

What is/are the main feature(s) of interest in your dataset?

What I know as a wine “expert” pH, alcohol and sugar has a big impact to the wine quality. So I gess this parameters should have the most impact to the predictive model to quality of white wine, thats my personal opinion.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

By comparing the distributions and mean values it could be that chlorides and density has an impact on quality.

Did you create any new variables from existing variables in the dataset?

First I changed the type of the variable quality from int to factor. In the dataset quality is the only categorical factor.


Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to dplyr, adjust, or change the form of the data? If so, why did you do this?

The histogram for the variable critic.acid strainge because there is a spike at level 0.5
I used dplyr to group the values per quality and calulate the mean values for some choosen parameter.
I plot the data between the quantiles 1% and 99% to increase some huge spikes in the plot.
For the parameter residual.sugar I used log10 transformation, to show that there are two group of whines.


Bivariate Plots Section

For a quick overview we created a correlation plot from all parameters:

Here are the parameter that have the highest correlation with quality:

The highest correlation is between residual.sugar and density
In the univariate capitel we saw that values of parameters changes in the different groups. Now we try to find some relations between the different parameters.

The highest positive correlation we can find between free.sulfur.dioxide, total.sulfur.dioxide, residual.sugar and density and on the other site negative correlations between alcohol and total.sulfur.dioxide, residual.sugar, density and chlorides.

Now we print some Boxplot for each input variable grouped by quality.

## [1] "Correlation between quality and alcohol:"
## [1] 0.4355747

Alcohol and quality have the highest correlation, in the boxplot you can see that by increasing the quality also the mean of the percentage of alcohol increasesas. Bad wines have less alcohol compared to good wines


## [1] "Correlation between quality and pH:"
## [1] 0.09942725

There is a very small correlation between pH and quality, I would not use pH as parameter for a linear model.


## [1] "Correlation between quality and density:"
## [1] -0.3071233

Density has the second highest (negative) correlation with quality but I would not use this parameter for a linear model together with alcohol because alcohol and density have a very high correlation, see later when we compare alcohol and density. As shown in the univariate section bad wines have a higher density than good wines.


## [1] "Correlation between quality and chlorides:"
## [1] -0.2099344

Chlorides have a lot of outlier (values higher than the 75% quantile) especial for quality level 5 and 6


## [1] "Correlation between quality and sulphates:"
## [1] 0.05367788

Sulphates has a small correlation with quality with some outliers.


## [1] "Correlation between quality and total.sulfur.dioxide:"
## [1] -0.1747372

It looks like the better the wine quality the higher the volatility of total.sulfur.dioxide, correlation is not that high.


## [1] "Correlation between quality and free.sulfur.dioxide:"
## [1] 0.008158067

The correlation is worse than for total.sulfur.dioxide, but has the same high outlier for good quality wines and the smallest correlation with quality.


## [1] "Correlation between quality and residual.sugar:"
## [1] -0.09757683

residual.sugar has a small correlation with quality but has a big correlation with the input variables alcohol and density, we do a closer look on it later.


## [1] "Correlation between quality and citric.acid:"
## [1] -0.009209091

Has the second smallest correlation with quality and has outlier in both directions.


## [1] "Correlation between quality and volatile.acidity:"
## [1] -0.194723

The correlation with the input variable quality is highest, that makes this variable as a perfect candidate for a parameter in a linear model.


## [1] "Correlation between quality and fixed.acidity:"
## [1] -0.1136628

Last but not least fixed.acidity shows a small correlation.



In the next step we will check the scatterplot of the three highest correlations, as shown in the correlation matrix.

## [1] "Correlation between total.sulfur.dioxide and free.sulfur.dioxide:"
## [1] 0.615501

total.sulfur.dioxide and free.sulfur.dioxide have a correlation of 0.64 the third highest correlation but unfortunately one of the smallest correlations with quality, for a linear model both can’t be used.


## [1] "Correlation between alcohol and density:"
## [1] -0.7801376

density and alcohol have the second highest (negative) correlation, meaning when the percentage of alcohol is increasing the density is falling, both variables have a high correlation to quality.


## [1] "Correlation between residual.sugar and density:"
## [1] 0.8389665

The highest correlation in the dataset is 0.84 between residual.sugar and density, I will make a closer look in the next section - Bivariate Analysis by comparing the three input variables density, alcoholand residual.sugar.

## [1] "Correlation between qual and alcohol:"
## [1] 0.4355747

The biggest impact for quality is alcohol (positive correlated)


Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

residual.sugar and density has a high positive correlation, comparted to the other correlation factor, so I will reject one of this parameter for a linear model.

There is a very strong negative correlation between alcohol and density of -0.78

Quality has correlations to density, chlorides, volatile.acidity and alcohol.


Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The dataset has a high number for quality nr. 5 and nr. 6 I was wondering that pH have not more impact on quality

What was the strongest relationship you found?

The strongest relationship for building a model to predict the quality of red wine is alcohol with correlation 0.44

Multivariate Plots Section


In the first graph we show the alcohol for each quality group in one chart


To show the interesting relationship between alcohol, density and quality on one hand and residual.sugar, density and quality on the other hand we make this plot.


To answer the question how sugar impact on alcohol and density, by using log function for residual.sugar we get a nicer seperation by color.


In the last picture I will show the density function for alcohol, pH, density and chlorides to show grafically what we did in the end of the Univariaty Analysis by using the group function


## 
## Calls:
## lin: lm(formula = as.numeric(quality) ~ alcohol, data = wqw[, 2:13])
## lin2: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity, 
##     data = wqw[, 2:13])
## 
## =====================================
##                      lin      lin2   
## -------------------------------------
## (Intercept)        0.582***  1.017***
##                   (0.098)   (0.098)  
## alcohol            0.313***  0.324***
##                   (0.009)   (0.009)  
## volatile.acidity            -1.979***
##                             (0.110)  
## -------------------------------------
## R-squared             0.190     0.240
## adj. R-squared        0.190     0.240
## sigma                 0.797     0.772
## F                  1146.395   773.875
## p                     0.000     0.000
## Log-likelihood    -5839.391 -5681.776
## Deviance           3112.257  2918.264
## AIC               11684.782 11371.552
## BIC               11704.272 11397.538
## N                  4898      4898    
## =====================================
## 
## Call:
## lm(formula = as.numeric(quality) ~ ., data = wqw[, 2:13])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8348 -0.4934 -0.0379  0.4637  3.1143 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.482e+02  1.880e+01   7.881 3.98e-15 ***
## fixed.acidity         6.552e-02  2.087e-02   3.139  0.00171 ** 
## volatile.acidity     -1.863e+00  1.138e-01 -16.373  < 2e-16 ***
## citric.acid           2.209e-02  9.577e-02   0.231  0.81759    
## residual.sugar        8.148e-02  7.527e-03  10.825  < 2e-16 ***
## chlorides            -2.473e-01  5.465e-01  -0.452  0.65097    
## free.sulfur.dioxide   3.733e-03  8.441e-04   4.422 9.99e-06 ***
## total.sulfur.dioxide -2.857e-04  3.781e-04  -0.756  0.44979    
## density              -1.503e+02  1.907e+01  -7.879 4.04e-15 ***
## pH                    6.863e-01  1.054e-01   6.513 8.10e-11 ***
## sulphates             6.315e-01  1.004e-01   6.291 3.44e-10 ***
## alcohol               1.935e-01  2.422e-02   7.988 1.70e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2803 
## F-statistic: 174.3 on 11 and 4886 DF,  p-value: < 2.2e-16

From my investigation I would choose the variables alcohol and volatile.acidity to build a linear modell to predict quality. If you compare that modell to the linear model that uses all parameter, we can descripte with our two parameter as much.


Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The most important parameter for predicting quality is alcohol and volatile.acidity that will be shown in the linear model.

Were there any interesting or surprising interactions between features?

Yes it was very surprising that all high correlated parameter with quality has a high correlation with alcohol, for example
quality - total.sulfur.dioxide - alcohol
quality - density - alcohol
quality - chlorides - alcohol

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes I build a linear model with the paramter alcohol and volatile.acidity.

The linear model has a R-squared of 0.24, that is very bad.


Final Plots and Summary

Plot One

Description One

The distribution of residual.sugar appears to be bimodal, there are two groups of sweetness of wihte wines in all quality groups. This graph is very interasting on one hand because by using of transformations you get more information and on the other hand has a high correlation to other input variables. Its an import input variable that describes the dataset. For predicting the quality of wines its not useable.

Plot Two

Description Two

The plot shows the correlationmatrix of all parameters in the dataset, it is a very important graph for further analysis because it shows the correlation coefficient for each input variable in form of a matrix. This graph is shown in the final plot because you get a quick overview which input variables can be used for a linear model. For example to predict the quality of wine we see that alcohol has the higest correlation with 0.44 and on the other hand you can see that volatile.acidity is the only input variable that has less correlation to the other variables.

Plot Three

Description Three

There is a high negative correlation between density and alcohol of -0.78, white wines with more alcohol have less sugar than wines with less alcohol, wines with density araound 1 have more residual.sugar than with density around 0.99 This graph is shown because this tree input variables describes the dataset most, unfortunately for building a linear model to predict the quality it is not a good joice because of the high correlation among themselves.


Reflection

It was a nice experience to work with that dataset. At the beginning I was happy to deal with no factors, on a second look I realized that the variable quality is a factor but it is used as integer. First I plot all histograms to get an idea of the dataset. There are ten different quality factors; this dataset uses only three (meaning that for three different categories more than 800 data’s are available). The second part analyzes the correlations; I was very surprised that alcohol and quality have a high correlation to the same parameters. That makes it very hard to find the parameters for a linear model. First I thought pH, sugar and alcohol are the main parameters but the data tell a different story. By choosing the parameter alcohol and, volatile.acidity I created a linear model with R-squared of 0.24 - that’s a very bad result. Reasons for that could be that the dataset has too less values for the different quality categories to get a representative result or the objective parameter quality does not fit with the chemical parameters. It would be very interesting to get a dataset with more data per quality.

By doing the project it was very challenging to get always the plots that I expected to get, for example by creating the third final plot it was not possible to see the different classes from the color, due to the fact of over plotting. By transforming the data with the log function I got a clearer picture. Another challenge was the creation of the correlation matrix to make everything readable and having the correct size of the x-, y- legend.

Finally it was not possible with linear regression analysis to find a satisfied model to predict the quality of white wines. Another approach could be decision tree instead of linear regression.